Preserve non-missingness during non-inner joins #1316

alyst · 2017-12-13T01:37:05Z

Currently, if you do a few joins, most of the columns in your data frame become "missable" (>: Missing).
That's because so far join preserves non-missingness only for inner joins:

inner join doesn't introduce new missing values => non-missingness of columns from both left and right frames should be preserved.

This PR adds the other logical rules:

left join doesn't introduce new missing values in the left columns => non-missingness of the left frame should be preserved
right join doesn't introduce new missing values in the right columns => non-missingness of the right frame should be preserved
joins do not introduce missing values to the on-columns => non-missingness of the on columns should be preserved if both left-on and right-on columns are non-missing

The rules above are "type stable", so to say. I.e. they don't depend on the contents of the frames.
But I thought it also makes sense to introduce data-dependent rules (there's no performance penalty in checking them):

if all right rows have matching left rows (i.e. no missing values are introduced to the left columns) => non-missingness of the left columns is preserved
if all left rows have matching right rows (i.e. no missing values are introduced to the right columns) => non-missingness of the right columns is preserved

As a reasonable side effect of the rules implementation, when doing the right join, the eltype (and levels, for categorical arrays) of the right on-column has priority over the left on-column.

alyst · 2017-12-13T02:48:02Z

Travis Mac CI stalled during Julia image download. other tests ok

cjprybol · 2017-12-13T04:18:24Z

Last time we considered this we voted against it because we were still using NullableArrays and we wanted to consistently interact with the columns using get and f.() dot-broadcasting, right? But now we're using the Missings approach and will interact with columns the same way whether they have missing data or not. And there will be downstream performance gains (at least in the near future) if we can avoid introducing the Union{T, Missing} and just keep T, so the extra work in join probably makes sense in many situations. Thanks for putting together this PR!

nalimilan

Thanks. I agree about the "type-stable" part, but I'm not a fan of changing the returned type depending on the data. Apart from making the behavior hard to predict, it won't work with a fully type-stable variant of DataFrame that I've considered supporting as an alternative. Do you really think it's common to make a left/right join without any missing values? The point of these operations is that you expect some rows to be missing from the second table, or you can use an inner join instead.

nalimilan · 2017-12-13T13:06:07Z

src/abstractdataframe/join.jl

-    _similar = kind == :inner ? similar : similar_missing
+    # inner and left joins preserve non-missingness of the left frame
+    # it is also preserved if all right rows have left matches
+    _similar_left = kind == :inner || kind == :left || length(rightonly_ixs.join) == 0 ? similar : similar_missing


Since the condition is very long, better use if ...; _similar_left = ...

nalimilan · 2017-12-13T13:06:41Z

src/abstractdataframe/join.jl

+        on_col_ix = findfirst(joiner.left_on, names(joiner.dfl)[i])
+        if on_col_ix > 0 && kind == :right
+            # if right join, construct the on-column
+            # using the right frame to preserve missingness and cat.levels


What's "cat.levels"?

categorical levels :) I'll expand it

nalimilan · 2017-12-13T13:09:17Z

src/abstractdataframe/join.jl

+            rcol = joiner.dfr_on[on_col_ix]
+            cols[i] = similar(rcol, nrow)
+            copy!(cols[i], view(rcol, all_orig_right_ixs))
+            permute!(cols[i], right_perm)


Isn't there an alternative solution which doesn't involve calling permute!? I'm concerned about the performance impact compared with the previous code which simply called copy!.

This is the case only for the right joins. Previously it was handled later in the code (if length(rightonly_ixs) > 0 block).
Actually, now it should be faster, because we compose the resulting column using the right frame directly, instead of initializing it using the left frame and fixing it later.

alyst · 2017-12-13T13:39:53Z

I'm not a fan of changing the returned type depending on the data. Apart from making the behavior hard to predict, it won't work with a fully type-stable variant of DataFrame that I've considered supporting as an alternative. Do you really think it's common to make a left/right join without any missing values? The point of these operations is that you expect some rows to be missing from the second table, or you can use an inner join instead.

The problem of the inner join is that it "swallows" the rows that do not have matches, and checking for that is more expensive (performance- and code size-wise) than the introduction of missing values and checking the resulting column type.
I'm also a little bit concerned about "type-instability". I was thinking whether we can introduce some switch to join() that specifies the missingness behaviour (don't have a good name for it ATM):

type-stable: no data-dependent rules
with data-dependent rules
strict: the original non-missingness is preserved, so if we are doing e.g. left join and some left rows have no matching right rows, but some right columns don't allow missing values, join() throws MissingException (so that's inner join with no "swallowed" rows allowed)

nalimilan · 2017-12-13T16:32:00Z

The problem of the inner join is that it "swallows" the rows that do not have matches, and checking for that is more expensive (performance- and code size-wise) than the introduction of missing values and checking the resulting column type.

It's not an issue if we use an Array{Union{T, Missing}} internally, as long as we take care of calling disallowmissing before returning it.

I'm also a little bit concerned about "type-instability". I was thinking whether we can introduce some switch to join() that specifies the missingness behaviour (don't have a good name for it ATM):
type-stable: no data-dependent rules
with data-dependent rules
strict: the original non-missingness is preserved, so if we are doing e.g. left join and some left rows have no matching right rows, but some right columns don't allow missing values, join() throws MissingException (so that's inner join with no "swallowed" rows allowed)

Given that we have disallowmissing!, I don't really see the advantage of providing a keyword argument to achieve the same result. And the "strict" case sounds unusual to me, as left/right joins are supposed to generate missing values anyway. I don't think any other software supports this kind of option, which could be confusing.

alyst · 2017-12-13T17:41:42Z

It's not an issue if we use an Array{Union{T, Missing}} internally, as long as we take care of calling disallowmissing before returning it.

I'm not sure I follow. By "internally" you mean within join()?
Or do you mean calling disallowmissing() after join() (from the user code)?
That's what I'm doing right now (after the left join) in my scripts to verify that each left row has right row match.
That's fine for the check (it requires just one right column), but if I want all non-missing right columns to stay non-missing, that's a few lines of dataschema-specific code.

Given that we have disallowmissing!, I don't really see the advantage of providing a keyword argument to achieve the same result.

See the usecase above, the keyword is equivalent to applying disallowmissing to all relevant columns (that were not of type Union{T,Missing} in the source frame and do not contain missing in the result).
I was thinking whether it could be implemented as a separate method, but it would require all 3 frames (left, right, result), so using it will look just like ugly join postprocessing.

And the "strict" case sounds unusual to me, as left/right joins are supposed to generate missing values anyway. I don't think any other software supports this kind of option, which could be confusing.

In SQL you have "INSERT INTO tbl SELECT ...", which checks that the result of SELECT fits the constraints of "tbl".
In data frame world we don't have data schemas, so the strict option is just a poor man's replacement (the result fits the constraints imposed by the source frames).
But I agree it would be confusing. Actually, since it's just inner join + the check that each row has a match, an alternative is to provide it as a new join kind (kind=:strict).

EDIT: I was wrong that "strict" is just a specific type of "inner" join. There could also be "strict left" (all left rows should have matching right rows) and "strict right" joins.

alyst · 2017-12-14T10:12:48Z

I've updated the commit removing the data-dependent rules.
It looks like there is consensus regarding the rest of the PR, so doesn't make sense to block it.
Data-dependent rules deserve a separate PR (I'll submit it as soon as this one is merged) and separate discussion.

alyst · 2017-12-14T10:31:03Z

CI passes on 0.6

nalimilan

Thanks. Indeed, better merge the uncontroversial parts first an discuss more problematic ones later.

nalimilan · 2017-12-14T14:24:01Z

src/abstractdataframe/join.jl

        for (on_col_ix, on_col) in enumerate(joiner.left_on)
            # fix the result of the rightjoin by taking the nonmissing values from the right table
            offset = nrow - length(rightonly_ixs.orig) + 1
            copy!(res[on_col], offset, view(joiner.dfr_on[on_col_ix], rightonly_ixs.orig))
        end
    end
+    if kind == :outer && !isempty(rightonly_ixs.join)
+        # some non-missing on-columns may have become missing


"missing column" can be confusing as it's not clear whether it's the column which might be missing. Better say something like "columns allowing for missing" even if it's a bit long.

Will fix. In the long run would be nice to have a term for it (even the ugly internal one, which would be used in cases like this).

nalimilan · 2017-12-14T14:29:47Z

src/abstractdataframe/join.jl

+        for (on_col_ix, on_col) in enumerate(joiner.left_on)
+            LT = eltype(joiner.dfl_on[on_col_ix])
+            RT = eltype(joiner.dfr_on[on_col_ix])
+            if Missings.T(LT) === LT && Missings.T(RT) === RT


Couldn't this just be !(LT >: Missing || RT >: Missing)? That's more explicit IMHO. It could also be worth adding a special case for Any columns, for which there's no point in calling disallowmissing since it won't have any effect.

I did it this way because Any >: Missing. But you're right, in this context we don't need to fix Any columns. Will fix.

But Missings.T(Any) === Any too.

Exactly, so the code would have called disallowmissing, which is wrong.
Since !(Any >: Missing || T >: Missing) === false, it would not be called.

I've added a testset for joins with Any eltypes.

nalimilan · 2017-12-14T14:32:32Z

test/join.jl

@@ -346,8 +343,8 @@ module TestJoin
        @test levels(join(B, A, on=:b, kind=:inner)[:b]) == ["a", "b", "c"]
        @test levels(join(A, B, on=:b, kind=:left)[:b]) == ["d", "c", "b", "a"]
        @test levels(join(B, A, on=:b, kind=:left)[:b]) == ["a", "b", "c"]
-        @test levels(join(A, B, on=:b, kind=:right)[:b]) == ["d", "c", "b", "a"]
-        @test levels(join(B, A, on=:b, kind=:right)[:b]) == ["a", "b", "d", "c"]
+        @test levels(join(A, B, on=:b, kind=:right)[:b]) == ["a", "b", "c"]


Why does this change? The docstring for join gives guarantees about the levels, so either it wasn't correct before or it might need to be updated.

In this PR the types and categorical levels of the right on-columns have priority over the left ones.
I think it's a logical thing. Also it's easier to implement.
So I can either patch the PR to the previous behaviour or update the docstring.

As long as it's not too hard to implement, I'd choose the rule which is the simplest to explain in the docstring, which is a good indication that it's easy to understand. IIUC, the new behavior you propose is to give priority to the left column, except for kind=:right? I'm not too fan of exceptions, even if I see the justification.

I've reverted to the old category levels ordering.

- left and right joins preserve non-missingness of the left and right frame columns, respectively - non-missingness of the on-columns is preserved (if non-missing in both frames)

alyst · 2017-12-18T12:38:11Z

squashed, in case you want to merge it preserving the commits

cjprybol approved these changes Dec 13, 2017

View reviewed changes

alyst mentioned this pull request Dec 13, 2017

Test non-matching joins #1317

Merged

nalimilan requested changes Dec 13, 2017

View reviewed changes

alyst force-pushed the join_coltypes branch from 76044d5 to 21582d6 Compare December 14, 2017 10:07

alyst force-pushed the join_coltypes branch from 21582d6 to 16379ef Compare December 14, 2017 10:13

nalimilan reviewed Dec 14, 2017

View reviewed changes

nalimilan approved these changes Dec 14, 2017

View reviewed changes

alyst added 2 commits December 18, 2017 13:35

join: try harder to preserve non-missingness

d01bd20

- left and right joins preserve non-missingness of the left and right frame columns, respectively - non-missingness of the on-columns is preserved (if non-missing in both frames)

add join() test for frames with Any column

c6bc53b

alyst force-pushed the join_coltypes branch from 568b985 to c6bc53b Compare December 18, 2017 12:37

nalimilan merged commit 6a1fa20 into JuliaData:master Dec 18, 2017

nalimilan mentioned this pull request Mar 17, 2018

Don't make join output DataValueArray unless there are NAs JuliaData/IndexedTables.jl#121

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve non-missingness during non-inner joins #1316

Preserve non-missingness during non-inner joins #1316

alyst commented Dec 13, 2017

alyst commented Dec 13, 2017

cjprybol commented Dec 13, 2017

nalimilan left a comment

nalimilan Dec 13, 2017

nalimilan Dec 13, 2017

alyst Dec 13, 2017

nalimilan Dec 13, 2017

alyst Dec 13, 2017

alyst commented Dec 13, 2017

nalimilan commented Dec 13, 2017

alyst commented Dec 13, 2017 •

edited

Loading

alyst commented Dec 14, 2017

alyst commented Dec 14, 2017

nalimilan left a comment

nalimilan Dec 14, 2017

alyst Dec 14, 2017

nalimilan Dec 14, 2017

alyst Dec 14, 2017

nalimilan Dec 14, 2017

alyst Dec 14, 2017

alyst Dec 14, 2017

nalimilan Dec 14, 2017

alyst Dec 14, 2017

nalimilan Dec 14, 2017

alyst Dec 14, 2017

alyst commented Dec 18, 2017

Preserve non-missingness during non-inner joins #1316

Preserve non-missingness during non-inner joins #1316

Conversation

alyst commented Dec 13, 2017

alyst commented Dec 13, 2017

cjprybol commented Dec 13, 2017

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alyst commented Dec 13, 2017

nalimilan commented Dec 13, 2017

alyst commented Dec 13, 2017 • edited Loading

alyst commented Dec 14, 2017

alyst commented Dec 14, 2017

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alyst commented Dec 18, 2017

alyst commented Dec 13, 2017 •

edited

Loading